-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding the StringEncoder transformer #1159
Conversation
Tests fail on minimum requirements because I am using PCA rather than TruncatedSVD for the decomposition, and that raises issues with potentially sparse matrices. @jeromedockes suggests using directly TruncatedSVD to begin with, rather than adding a check on the version. Also, I am using tf-idf as vectorizer, should I use something else? Maybe HashVectorizer? (writing this down so I don't forget) |
I'm very happy to see this progressing. Can you benchmark it on the experiments from Leo's paper: this is important for modeling choices (eg the hyper-parameters) |
Where can I find the benchmarks? |
Actually, let's keep it simple, and use the CARTE datasets, they are good enough: https://huggingface.co/datasets/inria-soda/carte-benchmark You probably want to instanciate a pipeline that uses TableVectorizer + HistGradientBoosting, but embeds one of the string columns with the StringEncoder (the one that is either higest cardinality, or most "diverse entry" in the sense of https://arxiv.org/abs/2312.09634 |
Should we also add this to the text encoder example, along the TextEncoder, MinHashEncoder and GapEncoder? It shows a tiny benchmark on the toxicity dataset. |
It's already there, and it shows that StringEncoder has performance similar to that of GapEncoder and runtime similar to that of MinHashEncoder |
That's very interesting! |
I updated the doc page on the Encoders, but it was only to add the |
My feeling is that OrdinalEncoder is just not that good if there is no order in the feature to begin with, while strings that are similar to each other usually are related no matter how they are sliced. I think an interesting experiment would be having a dictionary replacement where all strings in the starting table are replaced by random alphanumeric strings and check the performance of the encoders on that. In that case, I can imagine StringEncoder would not do so well compared to OrdinalEncoder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @rcap107! Here is a bunch of questions and nitpicks :)
if (min_shape := min(X_out.shape)) >= self.n_components: | ||
self.tsvd_ = TruncatedSVD(n_components=self.n_components) | ||
result = self.tsvd_.fit_transform(X_out) | ||
else: | ||
warnings.warn( | ||
f"The matrix shape is {(X_out.shape)}, and its minimum is " | ||
f"{min_shape}, which is too small to fit a truncated SVD with " | ||
f"n_components={self.n_components}. " | ||
"The embeddings will be truncated by keeping the first " | ||
f"{self.n_components} dimensions instead. " | ||
) | ||
# self.n_components can be greater than the number | ||
# of dimensions of result. | ||
# Therefore, self.n_components_ below stores the resulting | ||
# number of dimensions of result. | ||
result = X_out[:, : self.n_components].toarray() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe L140 to L155 could be brought into a common utils with the text encoder, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, not in this PR though
Co-authored-by: Vincent M <[email protected]>
Co-authored-by: Vincent M <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenML downloads still fail and break the CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!! Thanks a lot @rcap107 this is a great addition 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks @rcap107 :)
fine by me! |
+1 |
This is a first draft of a PR to address #1121
I looked at GapEncoder to figure out what to do. This is a very early version just to have an idea of the kind of code that's needed.
Things left to do: